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Abstract 

Algorithm selection is typically based on models of algorithm performance, learned during a separate 
offline training sequence, which can be prohibitively expensive. In recent work, we adopted an online 
approach, in which a performance model is iteratively updated and used to guide selection on a sequence 
of problem instances. The resulting exploration-exploitation trade-off was represented as a bandit prob- 
lem with expert advice, using an existing solver for this game, but this required the setting of an arbitrary 
bound on algorithm runtimes, thus invalidating the optimal regret of the solver. In this paper, we propose 
a simpler framework for representing algorithm selection as a bandit problem, with partial information, 
and an unknown bound on losses. We adapt an existing solver to this game, proving a bound on its ex- 
pected regret, which holds also for the resulting algorithm selection technique. We present preliminary 
experiments with a set of SAT solvers on a mixed SAT-UNSAT benchmark. 

1 Introduction 

Decades of research in the fields of Machine Learning and Artificial Intelligence brought us a variety 
of alternative algorithms for solving many kinds of problems. Algorithms often display variability in 
performance quality, and computational cost, depending on the particular problem instance being solved: 
in other words, there is no single "best" algorithm. While a "trial and error" approach is still the most 
popular, attempts to automate algorithm selection are not new ll33l . and have grown to form a consistent 
and dynamic field of research in the area of Meta-Learning l37l . Many selection methods follow an offline 
learning scheme, in which the availability of a large training set of performance data for the different 
algorithms is assumed. This data is used to learn a model that maps {problem, algorithm) pairs to expected 
performance, or to some probability distribution on performance. The model is later used to select and 
run, for each new problem instance, only the algorithm that is expected to give the best results. While this 
approach might sound reasonable, it actually ignores the computational cost of the initial training phase: 
collecting a representative sample of performance data has to be done via solving a set of training problem 
instances, and each instance is solved repeatedly, at least once for each of the available algorithms, or more 
if the algorithms are randomized. Furthermore, these training instances are assumed to be representative of 
future ones, as the model is not updated after training. 

In other words, there is an obvious trade-off between the exploration of algorithm performances on 
different problem instances, aimed at learning the model, and the exploitation of the best algorithm/problem 
combinations, based on the model's predictions. This trade-off is typically ignored in offline algorithm 
selection, and the size of the training set is chosen heuristically. In our previous work lfT3lfT4l[T5l . we have 
kept an online view of algorithm selection, in which the only input available to the meta-learner is a set of 
algorithms, of unknown performance, and a sequence of problem instances that have to be solved. Rather 
than artificially subdividing the problem set into a training and a test set, we iteratively update the model 
each time an instance is solved, and use it to guide algorithm selection on the next instance. 
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Bandit problems |[3l offer a solid theoretical framework for dealing with the exploration-exploitation 
trade-off in an online setting. One important obstacle to the straightforward application of a bandit prob- 
lem solver to algorithm selection is that most existing solvers assume a bound on losses to be available 
beforehand. In lfT6l[T5l we dealt with this issue heuristic ally, fixing the bound in advance. In this paper, 
we introduce a modification of an existing bandit problem solver |7|, which allows it to deal with an un- 
known bound on losses, while retaining a bound on the expected regret. This allows us to propose a simpler 
version of the algorithm selection framework GambleTA, originally introduced in ifTSl . The result is a 
parameterless onUne algorithm selection method, the first, to our knowledge, with a provable upper bound 
on regret. 

The rest of the paper is organized as follows. Section |2] describes a tentative taxonomy of algorithm 
selection methods, along with a few examples from literature. Section[3]presents our framework for repre- 
senting algorithm selection as a bandit problem, discussing the introduction of a higher level of selection 
among different algorithm selection techniques (time allocators). Section|4]introduces the modified bandit 
problem solver for unbounded loss games, along with its bound on regret. Section|5]describes experiments 
with SAT solvers. Section|6]concludes the paper 

2 Related work 

In general terms, algorithm selection can be defined as the process of allocating computational resources 
to a set of alternative algorithms, in order to improve some measure of performance on a set of problem 
instances. Note that this definition includes parameter selection; the algorithm set can contain multiple 
copies of a same algorithm, differing in their parameter settings; or even identical randomized algorithms 
differing only in their random seeds. Algorithm selection techniques can be further described according to 
different orthogonal features; 

Decision vs. optimisation problems. A first distinction needs to be made among decision problems, 
where a binary criterion for recognizing a solution is available; and optimisation problems, where different 
levels of solution quality can be attained, measured by an objective function 1221 . Literature on algorithm 
selection is often focused on one of these two classes of problems. The selection is normally aimed at min- 
imizing solution time for decision problems; and at maximizing performance quality, or improving some 
speed-quality trade-off, for optimisation problems. 

Per set vs. per instance selection. The selection among different algorithms can be performed once for 
an entire set of problem instances {per set selection, following |f24l); or repeated for each instance (per 
instance selection). 

Static vs. dynamic selection. A further independent distinction [31 ] can be made among static algorithm 
selection, in which allocation of resources precedes algorithm execution; and dynamic, or reactive, algo- 
rithm selection, in which the allocation can be adapted during algorithm execution. 

Oblivious vs. non-obUvious selection. In oblivious techniques, algorithm selection is performed from 
scratch for each problem instance; in non-oblivious techniques, there is some knowledge transfer across 
subsequent problem instances, usually in the form of a model of algorithm performance. 
Off-line vs. online learning. Non-oblivious techniques can be further distinguished as offline or batch 
learning techniques, where a separate training phase is performed, after which the selection criteria are 
kept fixed; and online techniques, where the criteria can be updated every time an instance is solved. 

A seminal paper in the field of algorithm selection is 1331 , in which offline, per instance selection is 
first proposed, for both decision and optimisation problems. More recently, similar concepts have been 
proposed, with different terminology (algorithm recommendation, ranking, model selection), in the Meta- 
Learning community lT2l [37l [Tsll . Research in this field usually deals with optimisation problems, and 
is focused on maximizing solution quality, without taking into account the computational aspect. Work 
on Empirical Hardness Models l27l [30l is instead applied to decision problems, and focuses on obtaining 
accurate models of runtime performance, conditioned on numerous features of the problem instances, as 
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well as on parameters of the solvers 12411 . The models are used to perform algorithm selection on a per 
instance basis, and are learned offline: online selection is advocated in 12411 . Literature on algorithm port- 
folios ll23l lT9l l32l is usually focused on choice criteria for building the set of candidate solvers, such that 
their areas of good performance do not overlap, and optimal static allocation of computational resources 
among elements of the portfolio. 

A number of interesting dynamic exceptions to the static selection paradigm have been proposed re- 
cently. In ll25l . algorithm performance modeling is based on the behavior of the candidate algorithms dur- 
ing a predefined amount of time, called the observational horizon, and dynamic context-sensitive restart 
policies for SAT solvers are presented. In both cases, the model is learned offline. In a Reinforcement 
Learning l36l setting, algorithm selection can be formulated as a Markov Decision Process: in l26ll . the 
algorithm set includes sequences of recursive algorithms, formed dynamically at run-time solving a sequen- 
tial decision problem, and a variation of Q-learning is used to find a dynamic algorithm selection policy; 
the resulting technique is per instance, dynamic and online. In |[3l1 . a set of deterministic algorithms is 
considered, and, under some limitations, static and dynamic schedules are obtained, based on dynamic 
programming. In both cases, the method presented is per set, offline. 

An approach based on runtime distributions can be found in ITOlim . for parallel independent processes 
and shared resources respectively. The runtime distributions are assumed to be known, and the expected 
value of a cost function, accounting for both wall-clock time and resources usage, is minimized. A dy- 
namic schedule is evaluated offline, using a branch-and-bound algorithm to find the optimal one in a tree 
of possible schedules. Examples of allocation to two processes are presented with artificially generated 
runtimes, and a real Latin square solver Unfortunately, the computational complexity of the tree search 
grows exponentially in the number of processes. 

"Low-knowledge" oblivious approaches can be found in l4]|5l, in which various simple indicators of 
current solution improvement are used for algorithm selection, in order to achieve the best solution quality 
within a given time contract. In Q, the selection process is dynamic: machine time shares are based on 
a recency-weighted average of performance improvements. We adopted a similar approach in ITSll . where 
we considered algorithms with a scalar state, that had to reach a target value. The time to solution was 
estimated based on a shifting-window Unear extrapolation of the learning curves. 

For optimisation problems, if selection is only aimed at maximizing solution quality, the same problem 
instance can be solved multiple times, keeping only the best solution. In this case, algorithm selection can 
be represented as a Max K-armed bandit problem, a variant of the game in which the reward attributed to 
each arm is the maximum payoff on a set of rounds. Solvers for this game are used in l9|[35l to implement 
oblivious per instance selection from a set of multi-start optimisation techniques: each problem is treated 
independently, and multiple runs of the available solvers are allocated, to maximize performance quality. 
Further references can be found in IITSl . 

3 Algorithm selection as a bandit problem 

In its most basic form l34l . the multi-armed bandit problem is faced by a gambler, playing a sequence 
of trials against an A^-armed slot machine. At each trial, the gambler chooses one of the available arms, 
whose losses are randomly generated from different stationary distributions. The gambler incurs in the 
corresponding loss, and, in the full information game, she can observe the losses that would have been paid 
pulling any of the other arms. A more optimistic formulation can be made in terms of positive rewards. 
The aim of the game is to minimize the regret, defined as the difference between the cumulative loss of the 
gambler, and the one of the best arm. A bandit problem solver (BPS) can be described as a mapping from 
the history of the observed losses Ij for each arm j, to a probability distribution p = (pi, ...,pn), from 
which the choice for the successive trial will be picked. 

More recently, the original restricting assumptions have been progressively relaxed, allowing for non- 
stationary loss distributions, partial information (only the loss for the pulled arm is observed), and adver- 
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sarial bandits that can set their losses in order to deceive the player In IHO, a reward game is considered, 
and no statistical assumptions are made about the process generating the rewards, which are allowed to be 
an arbitrary function of the entire history of the game {non-oblivious adversarial setting). Based on these 
pessimistic hypotheses, the authors describe probabilistic gambling strategies for the full and the partial 
information games. 

Let us now see how to represent algorithm selection for decision problems as a bandit problem, with 
the aim of minimizing solution time. Consider a sequence B = {hi, . . . , 6m} of M instances of a decision 
problem, for which we want to minimize solution time, and a set of K algorithms A = {ai, . . . , ax}, 
such that each 6,„ can be solved by each a^. It is straightforward to describe static algorithm selection 
in a multi-armed bandit setting, where "pick arm fc" means "run algorithm Ofe on next problem instance". 
Runtimes tk can be viewed as losses, generated by a rather complex mechanism, i.e., the algorithms 
themselves, running on the current problem. The information is partial, as the runtime for other algorithms 
is not available, unless we decide to solve the same problem instance again. In a worst case scenario one 
can receive a "deceptive" problem sequence, starting with problem instances on which the performance of 
the algorithms is misleading, so this bandit problem should be considered adversarial. As BPS typically 
minimize the regret with respect to a single arm, this approach would allow to implement per set selection, 
of the overall best algorithm. An example can be found in lfT6l . where we presented an online method for 
learning a per set estimate of an optimal restart strategy. 

Unfortunately, per set selection is only profitable if one of the algorithms dominates the others on all 
problem instances. This is usually not the case: it is often observed in practice that different algorithms 
perform better on different problem instances. In this situation, a per instance selection scheme, which can 
take a different decision for each problem instance, can have a great advantage. 

One possible way of exploiting the nice theoretical properties of a BPS in the context of algorithm 
selection, while allowing for the improvement in performance of per instance selection, is to use the BPS 
at an upper level, to select among alternative algorithm selection techniques. Consider again the algorithm 
selection problem represented by B and A. Introduce a set of N time allocators (TAj) ifTsl fTsl. Each TAj 
can be an arbitrary function, mapping the current history of collected performance data for each a^., to a 
share s*^^' e [0, 1]"^^, with X^fcLi -^fe = 1. A TA is used to solve a given problem instance executing all 
algorithms in A in parallel, on a single machine, whose computational resources are allocated to each Uk 
proportionally to the corresponding Sk, such that for any portion of time spent t, Skt is used by a^, as in a 
static algorithm portfoUo ||231 . The runtime before a solution is found is then mmk{tk/sk}, tk being the 
runtime of algorithm Ofc. 

A trivial example of a TA is the M«!/brOT time allocator, assigning a constant s = {l/K, l/K). Single 
algorithm selection can be represented in this framework by setting a single Sk to 1. Dynamic allocators 
will produce a time-varying share s{t). In previous work, we presented examples of heuristic oblivious lfT3l 
and non-oblivious |[14l allocators; more sound TAs are proposed in ifTSl . based on non-parametric models 
of the runtime distribution of the algorithms, which are used to minimize the expected value of solution 
time, or a quantile of this quantity, or to maximize solution probability within a give time contract. 

At this higher level, one can use a BPS to select among different time allocators, TAj , TA2 . . ., working 
on a same algorithm set A. In this case, "pick arm j" means "use time allocator TAj on A to solve next 
problem instance". In the long term, the BPS would allow to select, on a per set basis, the TAj that is best 
at allocating time to algorithms in ^ on a per instance basis. The resulting "Gambling" Time Allocator 
(GambleTA) is described in Alg.[I] 

If BPS allows for non-stationary arms, it can also deal with time allocators that are learning to allocate 
time. This is actually the original motivation for adopting this two-level selection scheme, as it allows 
to combine in a principled way the exploration of algorithm behavior, which can be represented by the 
uniform time allocator, and the exploitation of this information by a model-based allocator, whose model is 
being learned online, based on results on the sequence of problems met so far If more time allocators are 
available, they can be made to compete, using the BPS to explore their performances. Another interesting 
feature of this selection scheme is that the initial requirement that each algorithm should be capable of 
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Algorithm 1 GAMBLETA(yt, T, BPS) Gambling Time Allocator. 
Algorithm set A with K algorithms; 
A set T of N time allocators TA^ ; 
A bandit problem solver BPS 
M problem instances. 

initialize BPS(iV,M) 

for each problem bi,i ^ 1, M do 

pick time allocator = j with probability pj (i) from BPS. 

solve problem bi using TAj on A 

incur loss ^/(j) = nimk{tkii) / s['\i)} 

update BPS 
end for 



solving each problem can be relaxed, requiring instead that at least one of the can solve a given bm, and 
that each TAj can solve each 6,„; this can be ensured in practice by imposing a > for all Uk- This 
allows to use interesting combinations of complete and incomplete solvers in A (see Sect.|5]i. Note that any 
bound on the regret of the BPS will determine a bound on the regret of GambleTA with respect to the 
best time allocator. Nothing can be said about the performance w.rt. the best algorithm. In a worst-case 
setting, if none of the time allocator is effective, a bound can still be obtained by including the uniform 
share in the set of TAs. In practice, though, per-instance selection can be much more efficient than uniform 
allocation, and the literature is full of examples of time allocators which eventually converge to a good 
performance. 

The original version of GambleTA (GambleTA4 in the following) ifTSi was based on a more com- 
plex alternative, inspired by the bandit problem with expert advice, as described in ||2]|3|. In that setting, 
two games are going on in parallel: at a lower level, a partial information game is played, based on the 
probability distribution obtained mixing the advice of different experts, represented as probability distri- 
butions on the K arms. The experts can be arbitrary functions of the history of observed rewards, and 
give a different advice for each trial. At a higher level, a full information game is played, with the N 
experts playing the roles of the different arms. The probability distribution p at this level is not used to 
pick a single expert, but to mix their advices, in order to generate the distribution for the lower level arms. 
In GambleTA4, the time allocators play the role of the experts, each suggesting a different s, on a per 
instance basis; and the arms of the lower level game are the K algorithms, to be run in parallel with the 
mixture share. Exp4 ||2l[3l is used as the BPS. Unfortunately, the bounds for Exp4 cannot be extended 
to GambleTA4 in a straightforward manner, as the loss function itself is not convex; moreover, Exp4 
cannot deal with unbounded losses, so we had to adopt an heuristic reward attribution instead of using the 
plain runtimes. 

A common issue of the above approaches is the difficulty of setting reasonable upper bounds on the 
time required by the algorithms. This renders a straightforward application of most BPS problematic, as a 
known bound on losses is usually assumed, and used to tune parameters of the solver Underestimating this 
bound can invalidate the bounds on regret, while overestimating it can produce an excessively "cautious" 
algorithm, with a poor performance. Setting in advance a good bound is particularly difficult when dealing 
with algorithm runtimes, which can easily exhibit variations of several order of magnitudes among different 
problem instances, or even among different runs on a same instance ll20l . 

Some interesting results regarding games with unbounded losses have recently been obtained. In 17] [8], 
the authors consider a full information game, and provide two algorithms which can adapt to unknown 
bounds on signed rewards. Based on this work, fl | provide a Hannan consistent algorithm for losses whose 
bound grows in the number of trials i with a known rate i", i' < 1/2. This latter hypothesis does not fit well 
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our situation, as we would like to avoid any restriction on the sequence of problems: a very hard instance 
can be met first, followed by an easy one. In this sense, the hypothesis of a constant, but unknown, bound is 
more suited. In Q, Cesa-Bianchi et al. also introduce an algorithm for loss games with partial information 
(Exp3 Light), which requires losses to be bound, and is particularly effective when the cumulative loss of 
the best arm is small. In the next section we introduce a variation of this algorithm that allows it to deal 
with an unknown bound on losses. 

4 An algorithm for games with an unknown bound on losses 

Here and in the following, we consider a partial information game with N arms, and M trials; an index [i) 
indicates the value of a quantity used or observed at trial i E {!,..., M}; j indicate quantities related to 
the j-th arm, j E {1, . . . , N}; index E refers to the loss incurred by the bandit problem solver, and 
indicates the arm chosen at trial (i), so it is a discrete random variable with value in {1, ... , N}; r, u will 
represent quantities related to an epoch of the game, which consists of a sequence of or more consecutive 
trials; log with no index is the natural logarithm. 

Exp3Light (7] Sec. 4] is a solver for the bandit loss game with partial information. It is a modified 
version of the weighted majority algorithm 1291 , in which the cumulative losses for each arm are obtained 
through an unbiased estimat^H The game consists of a sequence of epochs r = 0, 1, . . .: in each epoch, 
the probability distribution over the arms is updated, proportional to cxp {—i]rLj), Lj being the current 
unbiased estimate of the cumulative loss. Assuming an upper bound 4'' on the smallest loss estimate, rjr is 
set as: 



When this bound is first trespassed, a new epoch starts and r and ?],. are updated accordingly. 

The original algorithm assumes losses in [0, 1]. We first consider a game with a known finite bound 
C on losses, and introduce a slightly modified version of Exp3 Light (Algorithm |2]), obtained simply 
dividing all losses by C. Based on Theorem 5 from |j7], it is easy to prove the following 

Theorem 1. If L*{M) is the loss of the best arm after M trials, and Le(M) = is fhe loss 

o/Exp3Light(A^, M, L), the expected value of its regret is bounded as: 



The proof is trivial, and is given in the appendix. 

We now introduce a simple variation of Algorithm [2] which does not require the knowledge of the 
bound C on losses, and uses Algorithm |2] as a subroutine. Exp3Light-A (Algorithm O is inspired by 
the doubling trick used in Q for a/w/Z information game with unknown bound on losses. The game is 
again organized in a sequence of epochs u = 0, 1, . . .: in each epoch. Algorithm |2] is restarted using a 
bound Cu = 2"; a new epoch is started with the appropriate u whenever a loss larger than the current £„ 
is observed. 

' For a given round, and a given arm with loss / and pull probability p, the estimated loss Hs i/p if the arm is pulled, otherwise. 
This estimate is unbiased in the sense that its expected value, with respect to the process extracting the arm to be pulled, equals the 
actual value of the loss: E{1} = pl/p + (1 — p)0 = I. 




(1) 



E{Le{M)}^L*{M) 



(2) 



< 2 \/&C{\og N + N\og M)NL* (M) 



+ C[2y^2CilogN + N log M)N 
+ (27V + l)(l + log4(3M + l))] 
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Algorithm 2 Exp3Light(A^, M, C) A solver for bandit problems with partial information and a known 

bound C on losses. 

N arms, M trials 

losses e [0,/:] Vi = 1, Af, j = 1, ... ,7V 
initialize epoch r = Q, Le ^ Lj (0) = 0. 
initialize ?/r according to ([T]) 
for each trial i — I, ...,M do 

setpj (i) cx exp(-77rLj(i - = 1- 

pick arm = j with probability Pj (i). 

incur loss Isii) — 

evaluate unbiased loss estimates: 

h(i){'i) = li{i)ii)/Pi{i)ii), Ij = for j ^ 
update cumulative losses; 

Lsii) = LE{i-l)+lE{i), 

(i) = Lj + I J (i), for j = 1, . . . , TV 

L*{i) = minjLjii). 
if (L*(i)//:) > 4'' then 

start next epoch r = [log4(L*(z)/£)] 

update rjr according to (dj 
end if 
end for 



Algorithms Exp3Light-A(A^, M) A solver for bandit problems with partial information and an unknown 
(but finite) bound on losses. 
iV arms, M trials, 

losses e [0,/:] Vi = l,...,M,j = 1,. ..,7V 
unknown £ < oo 

initialize epoch u = 0, Exp3Light(A^, M, 2") 
for each trial i = 1, i\/ do 

pick arm — j with probabihty pj (i) from ExpBLight. 
incur loss Isii) = 
if l^i) (i) > then 

start next epoch u = [log2 
restart Exp3 Light (iV, Af - i, 2") 
end if 
end for 
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Theorem 2. If L*{M) is the loss of the best arm after M trials, and C < oo is the unknown bound on 
losses, the expected value of the regret o/ExP3LlGHT-A(iV, M) is bounded as: 

E{Le{M)}- L*{M)< (3) 

4^/3 riog2 C] C (log N + N\og M) NL*{M) 
+ 2 [log2 £1 £[ V4/:(log TV + TV log Af )iV 
+ (2TV + l)(l + log4(3A/ + l)) + 2] 

The proof is given in the appendix. The regret obtained by Exp3Light-A is CN log A/i*(A/)), 
which can be useful in a situation in which C is high but L* is relatively small, as we expect in our time 
allocation setting if the algorithms exhibit huge variations in runtime, but at least one of the TAs eventually 
converges to a good performance. We can then use Exp3Light-A as a BPS for selecting among different 
time allocators in GambleTA (Algorithm[T]i- 

5 Experiments 

The set of time allocator used in the following experiments is the same as in ifTSll . and includes the uniform 
allocator, along with nine other dynamic allocators, optimizing different quantiles of runtime, based on 
a nonparametric model of the runtime distribution that is updated after each problem is solved. We first 
briefly describe these time allocators, inviting the reader to refer to ifTSl for further details and a deeper 
discussion. A separate model i^fc(t|x), conditioned on features x of the problem instance, is used for each 
algorithm a^-. Based on these models, the runtime distribution for the whole algorithm portfolio A can be 
evaluated for an arbitrary share s e [0, 1]^, with Y^^=i = 1, as 

K 

FAAt)^'^~l[[^-Fk{skt)]. (4) 

k=l 

Eq. (HI can be used to evaluate a quantile s (a) = (o;) for a given solution probability a. Fixing 
this value, time is allocated using the share that minimizes the quantile 

s = arg mm F^^ (a) . (5) 

Compared to minimizing expected runtime, this time allocator has the advantage of being applicable even 
when the runtime distributions are improper, i. e. F{oo) < 1, as in the case of incomplete solvers. A 
dynamic version of this time allocator is obtained updating the share value periodically, conditioning each 
Fk on the time spent so far by the corresponding a^. 

Rather than fixing an arbitrary a, we used nine different instances of this time allocator, with a ranging 
from 0.1 to 0.9, in addition to the uniform allocator, and let the BPS select the best one. 

We present experiments for the algorithm selection scenario from ifTSl . in which a local search and a 
complete SAT solver (respectively, G2-WSAT 1281 and Satz-Rand 1201 ) are combined to solve a sequence 
of random satisfiable and unsatisfiable problems (benchmarks uf-*, uu-* from li2Ti . 1899 instances in 
total). As the clauses-to-variable ratio is fixed in this benchmark, only the number of variables, ranging 
from 20 to 250, was used as a problem feature x. Local search algorithms are more efficient on satisfi- 
able instances, but cannot prove unsatisfiability, so are doomed to run forever on unsatisfiable instances; 
while complete solvers are guaranteed to terminate their execution on all instances, as they can also prove 
unsatisfiability. 

For the whole problem sequence, the overhead of GAMBLETA3 (Algorithm[Tl using Exp3Light-A 
as the BPS) over an ideal "oracle", which can predict and run only the fastest algorithm, is 22%. Gam- 
BLETA4 (from jTSl , based on Exp4) seems to profit from the mixing of time allocation shares, obtaining 
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Figure 1: (a): Cumulative time spent by GAMBLETA3 and GAMBLETA4 (TS) on the SAT-UNSAT problem set 
(10^ « 1 min.). Upper 95% confidence bounds on 20 runs, with random reordering of the problems. ORACLE is the 
lower bound on performance. UNIFORM is the (0.5,0.5) share. Satz-Rand is the per-set best algorithm, (b): The 
evolution of cumulative overhead, defined as {"^^j tcU) — X]j ^o{j))/ io{j), where to is the performance of 
GambleTA and to is the performance of the oracle. Dotted lines represent 95% confidence bounds. 



a better 14%. Satz-Rand alone can solve all the problems, but with an overhead of about 40% w.rt. the 
oracle, due to its poor performance on satisfiable instances. Fig.[T]plots the evolution of cumulative time, 
and cumulative overhead, along the problem sequence. 



6 Conclusions 

We presented a bandit problem solver for loss games with partial information and an unknown bound on 
losses. The solver represents an ideal plug-in for our algorithm selection method GambleTA, avoiding 
the need to set any additional parameter The choice of the algorithm set and time allocators to use is still 
left to the user. Any existing selection technique, including oblivious ones, can be included in the set of TV 
allocators, with an impact 0{\/N) on the regret: the overall performance of GambleTA will converge to 
the one of the best time allocator. Preliminary experiments showed a degradation in performance compared 
to the heuristic version presented in ITSl . which requires to set in advance a maximum runtime, and cannot 
be provided of a bound on regret. 

According to H, a bound for the original Exp3Light can be proved for an adaptive 7]r in which 
the total number of trials M is replaced by the current trial i. This should allow for a potentially more 
efficient variation of Exp3Light-A, in which Exp3Light is not restarted at each epoch, and can retain 
the information on past losses. 

One potential advantage of offline selection methods is that the initial training phase can be easily par- 
allelized, distributing the workload on a cluster of machines. Ongoing research aims at extending Gam- 
bleTA to allocate multiple CPUs in parallel, in order to obtain a fully distributed algorithm selection 
framework fTTJ . 
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Appendix 

A.l Proof of Theorem [T] 

The proof is trivially based on the regret for the original Exp3Light, with £ = 1, which according to Q 
Theorem 5] (proof obtained from ||6l) can be evaluated using the optimal values ([T]| for r/^ '■ 

E{Le{M)]-L*{M)< (6) 
2 V2(log TV + TV log M)N{1 + 3L*(M)) 
+ (2iV + l)(l + log4(3A/ + l)). 

As we are playing the same game normalizing all losses with C, the following will hold for Alg. |2] 

E{Le{M)}-L*{M) 



c 



< (7) 



2V2(logAr + N log M)N {I + 3L*{M)/C) (8) 
+ (2iV+l)(l + log4(3M+l)). (9) 

Multiplying both sides for C and rearranging produces (|2|i. □ 

A.2 Proof of Theorem H 

This follows the proof technique employed in Q Theorem 4]. Be iu the last trial of epoch u, i. e. the first 
trial at which a loss > 2" is observed. Write cumulative losses during an epoch u, excluding the 

last trial i„, as L*^") = X]i=i ^(*)' ^'^'^ L*^'^^ = min^ X]i=7^_i+i indicate the optimal loss for 
this subset of trials. Be J7 = u{M) the a priori unknown epoch at the last trial. In each epoch u, the bound 
(|2|i holds with = 2" for all trials except the last one i„, so noting that log(M — i) < log(A/) we can 
write: 

£;{L^"^} - L*(") < (10) 

2y^6£„(log iV + iV log M)NL*(^) 

+ Cu [2 ^2Cn (log N + N\ogM)N 
+ (2Ar + l)(l + log4(3M+l))]. 

The loss for trial can only be bound by the next value of £„, evaluated a posteriori: 

E{Ie{iu)}-1*{iu)<Cu+i, (11) 
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where = miiij indicates the optimal loss at trial i. 

Combining jlOll 11 1, and writing i_i=0, = M, we obtain the regret for the whole gameQ 

i?{i^(M)}-f]L*(«)-f]r(z„) 

u=0 u=0 



+ Cu[2^2CuilogN + NlogM)N 

+ (2Ar + l)(l + log4(3Af + 1))]} 
u 

tt=0 

The first term on the right hand side of (IT2l ) can be bounded using Jensen's inequality 



u 



\ 



u 



{U + l)J2^u, (12) 



with 



a„ = 24£„(logiV + iVlogM)A/-L*('') (13) 
< 24/:,7+i(logiV + 7VlogA/)iVL*("). 

The other terms do not depend on the optimal losses and can also be bounded noting that < 

We now have to bound the number of epochs U. This can be done noting that the maximum observed 
loss cannot be larger than the unknown, but finite, bound C, and that 

U + l= riog2 max,li(i){i)] < [loga C] , (14) 

which imphes 

Cu+i = 2^+1 < 2C. (15) 

In this way we can bound the sum 

u riog2 -ci 

^Cu+i< J2 2" < 2i+r'°S2£l < 4£. (16) 

u=0 11=0 



We conclude by noting that 



L*{M) = min.jLj{M) (17) 
u u u 



- Note that all cumulative losses are counted from trial iu-i + 1 to trial in — 1- If an epoch ends on its first trial, jlO) is zero, and 
jilt holds. Writing iij = M implies the worst case hypothesis that the bound Cjj is exceeded on the last trial. Epoch numbers u are 
increasing, but not necessarily consecutive: in this case the terms related to the missing epochs are 0. 



Technical Report No. IDSIA-07-08 



14 



Inequality (fT2] i then becomes: 

E{LEiM)}~ L*{M) 
< 2y/6{U + l)Cu+iilogN + NlogM)NL*{M) 
+ [U + l)Cu+i [2 V2£c/+i (log N + N log M)N 
+ (27V + l)(l + log4(3M + l))] +4/:. 

Plugging in ( fT4l[T5] l and rearranging we obtain ((Sj. □ 



